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"FINGER-POINTER": POINTING INTERFACE 
BY IMAGE PROCESSING 

Masaaki Fukumoto. Yasuhito Suenaga and Kenji Mase 

NTT Human Interface Laboratories. Yokosuka. 2.>8-03 Japan 
Ah« ra « We have developed an experimental system for the 3D direct pointing interface --Finger-Pointer." 

visual channels by introducing the "Timing Tag technique. 



1. INTRODUCTION 

We think that there are two types of man-machine 
interface. One is the "professional" interface, which 
pursues high interaction speed, and the other is the 
-common" interface, which doesn't require any prac- 
tice for use. The professional interface must be able to 
transmit the operator's intention to a machine rapidly 
and accurately, however, the necessity of practice or 
nuisance of setup is a trifling problem for this kind of 
interface. On the other hand, the common interface 
must be used bv all people simply and easily. Accord- 
ingly the practice needed to operate this interface must 
be reduced as much as possible. The keyboard, the 
most prevalent computer interface, is useful for com- 
mand and character input, but it requires the user to 
practice a lot if he is to become fluent. We think in- 
formation systems that will be used by everyone, such 
as the genera! information terminal supported by in- 
telligent computer agents, must employ a common in- 
terface instead of professional interfaces such as key- 
boards. Especially, the common interfaces that adopt 
the interaction style used in ordinary human-to-human 
conversation don't require any practice to use, and 
everyone can operate that interface easily. 

Human-to-Human interaction is composed of verbal 
and nonverbal modes[H. The role of the nonverbal- 
mode, which encompasses posture, gesture, gaze, facial 
expression and so on. is as important as that of the 
verbal-mode. We have proposed the interface concept 
named "Human Reader" 1 2]. which integrates verbal 
and nonverbal interaction modes by mainly using im- 
age processing techniques. A Human Reader consists 
of several recognition modules. "Head Reader' 13] 
recognizes human head motion such as brief responses 
(Yes/No). The "Face Reader" [4] understands human 
facial actions. In this paper, we focus on human-ges- 
tures that are extremely expressive in many nonverbal 
modes and use it to construct a human-computer in- 
terface. 

Human gestures can be classified into three groups 
{Table 1 ). The first gesture group contains the pointing 



actions used to indicate 2D or 3D location; we call this 
the "locator" group. The next group, which we call 
"switcher," includes gestures that select between two 
or more states. The next group, named "valuator." 
contains gestures that indicate quantity (ex: "about 
this size" or "rotate this much" ). The last group com- 
prises gestures that visualize shapes, actions or feelings 
such as -triangle" or -running." we call this group 
"imager." Sign languages and body languages belong 
in this group. This imager group has strong expressive 
power, but the difficulty of accurate recognition is cor- 
respondingly increased. 

Several interface prototype systems have been pro- 
posed. The multi modal graphics interface^) employs 
"locator" gestures for pointing input. The finger-based 
commands used in many virtual reality systems(6] 
and the finger speliing[7 ] used by blind people belong 
in the "switcher" group. Some object manipulation 
gestures used in computer aided design tools[8] belong 
in the "valuator" group. The few systems recognize 
sign language gestures[9. 10], which have a relatively 
stnet grammar, belong to the "imager group. 

In these svstems, however, the operator is forced to 
wear special devices, such as Data-Glove or a magnetic- 
sensor. Some approaches using image processing 
methods free the operator from these devices. For ex- 
ample, the object handling system! 11] recognizes 
pointing direction and hand forms by stereo T\ cam- 
eras mounted above and in front of the operator, and 
the visual interface system^] recognizes hand signs 
bv a single TV camera in front or the operator. These 
systems, however, cannot work in real-time or need 
special image processing hardware. 

Human beings normally interact by using plural 
communication modes simultaneously such as v 0 ,ce 
and gesture. The synchronization and integration of 
parallel input modes are necessar> for realizing a multi- 
modal computer interface. Some prototype systems 
using pointing and hand gestures [6.13] accept multi- 
modal input messages such as a combination of point- 
ing gestures and voice commands. In these systems. 
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Class 



Locator 
Switcher 
Valuator 
Imager 
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Table I. Classification of gesiure 



Content 



Example 



Indicate location in space 
Select from some slates 
Indicate extents 
Indicate general images 



Pointing 

Hand spelling 

Object manipulation 

Sign language, body language 



however, the problem of synchronizing the input 
modules, all of which have different recognition speeds- 
has not been solved. 

As the prototype of a gesture interface, we developed 
the human-potnting action recognition system called 
"Finger-Pointer." Gestures used in this system belong 
,o the "locator," "switcher" and some "valuator 
groups. Bv using a simple and fast image processing 
method, the svstem can recognize 3D pointing act.ons 
and simple hand forms in real-time without forc.ng 
the user to wear anv special device. The operator can 
interact with the system by combination of pointing 
gestures and voice commands without concern for the 
time lag of each input channel. The next section out- 
lines svstem construction of the "Finger-Pointer. Fast 
image processing methods for hand image detection 
are then described. Next, the new pointing direction 
determination method called "Virtual Project.on Or- 
igin ( VFO)" is proposed. Experiments have shown that 
VPO is verv effective in various situations. Next, multi- 
channel synchronization using "Timing-Tags" is de- 
scribed. Finally, remaining problems and possible ap- 
plications of this system are discussed. 

:. FINGER-POINTER 

"> I Svstem concept 

A main purpose of the Finger-Pointer system is to 
allow the user to communicate with various machines 
such as presentation or audio-visual instruments with 
pointing actions, hand forms and voice commands in 
a meeting space or a living room. Figure 1 shows the 
concept of the Finger-Pointer system. The operator s 
pointing actions are captured by two stereoscopic TV 
cameras: one mounted on the wall and the other on 



the ceiling. The svstem determines the 3D coordinates 
of the operator's finger tip by analyzing the camera 
images and uses the pointing direction as "locator 
The svstem also can recognize several simple hand 
forms' as "switcher" or "valuator" by analyzing the 
image captured by the camera mounted on the wall. 
The operator can communicate with the system ustng 
a natural combination of voice and gestures. In order 
to achieve more accurate pointing, the system can dis- 
play a pointing cursor on the front screen to prov.de 
feedback to the user. 

2 2. Structure of the system 

Figure ■> shows a block diagram of the system. This 
svstem employs two monochrome CCD cameras 
driven bv one "sync" signal. The ceiling camera image 
is converted into the "R" plane of the digitizing unit, 
and the wall camera image is convened into the G 
plane These camera images are then digitized by the 
video digitizing unit of the graphic work station 
(GWS'). The operator's voice level is also digitized in 
the "B" plane and used to generate "Timing-Tags 
(described later) for voice and gesture synchronization. 

The svstetn works on the GWS and processes 10 
frames per second without any special image processing 
hardware. The user-specified. separated-word-rype 
voice recognition unit ( Voice Navigator) on a personal 
computer ( Macintosh llfx ) is used for vo.ce command 
recognition. By using a telescopic type microphone, 
the operator doesn't need to wear even a headset An- 
other GWS (personal IRIS) and a Hi-scanned Vtdeo 
Projector are used as the application platform. 
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™ Th<- ooera.ors pointing actions arc captured by two TV 
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3. IMAGE PROCESSING METHODS 
3 1 Finger-tip detection for "Locator" 
' Figure 3 illustrates the method of determining finger 
tip location, and it's algorithm is described below, 
I Binarize the two images captured by the ceiling and 
the wall cameras with a fixed threshold, and extract 
hand regions. 

-> Scan each binary image and determine the pixel 
that is closest to the screen, as the most likely can- 
didate for the finger tip. 

3 Calculate 3D position of the candidate pixel from 
the location of the candidate pixel in each image 
and camera parameters. 

4 Decide whether that candidate pixel represents the 
real finger tip. based upon the length and thickness 
of the extracted region. (Length and thickness or 
index fingers and camera parameters are predeter- 
mined values.) 

The system makes the following assumptions to assist 
finger-tip detection: 

. When using a reflected light source, the hand region 



is lighter than the background region (for backlight- 
ing, it's darker). 

. The operator's finger tip is the part of his body nearest 
the screen while he is pointing. 

• The length and thickness of human fingers are not 
drastically different (no calibration is necessary for 
each operator). 

Furthermore, the position at which the finger will 
next appear is estimated by the two most recent finger 
tip positions for speedy processing. If the operators 
finger tip is included in the tracking area (about 8 » of 
the captured image), the system can detect the finger 
tip candidate pixel quickly. After finger tip location is 
detected, the svstem determines the pointing direction 
by the use of the "Virtual Projection Origin" (described 
later). 

3 "> Thumb-click detection for "Trigger" 

Some trigger action is necessary to extracting a spe- 
cific pointing direction because the pointing d.rection 
is continuously detected. Conceivable trigger actions 
are (a) existence of finger or hand, (b) static finger 




rf: index-finger length 
df: hand region length on the scan circle 
wf : index-finger thickness 



Finger-tip 
candidate 
pixel 



Scan direction 
(toward screen) 

. *r,n the finwr tip candidate that is closest to the screen, and confirm 

F*. 3. Dining fi^ 
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Thumb-tip 
candidate 
pixel 




rt: thumb-finger length 
dt: hand region length on the scan circle 
wt: thumb-finger thickness 
rr: length between finger-tip and wrist 



Wall-camera 
Image 



pjo 4 Thumb-Swich detection. Scan the thumb tip candidate in the spreading fan manner from the line 
determined by the finger tip and wrist position. 



position in some predetermined period, (c) the use of 
particular finger tip motion (example: drawing a small 
circle). All three of these actions reduce the pointing 
accuracy and operating speed. In addition, action clas- 
sification (separation of "trigger from "locator") is 
necessary. It is desirable that the "trigger" action be- 
come independent of the "locator" action, if possible. 
The Finger-Pointer system employs the thumb bending 
action as "trigger" and index finger direction as "lo- 
cator," these two actions are basically independent. 
With appropriate system functions, the operator can 
use a click and drag function, similar to that possible 
with a one-button mouse. 

The thumb scanning method is similar to the 
method used to detect the finger tip (Fig. 4). 

1 . Scan the binarized wall camera image and deter- 
mine the wrist center. 

2. Scan the image in a spreading fan pattern from the 
line determined by the operator's finger tip and wrist 
position, and determine the uppermost pixel of the 
hand region in the scanned area as the candidate 
for the thumb tip. 

3. Decide whether that candidate pixel represents the 
real thumb tip, based upon the length and thickness 



of the extracted region. (Thumb length and thick- 
ness are predetermined values.) 

3.3. Finger-number detection for "Valuator" 

People often use their fingers to indicate numbers. 
We call this gesture "Finger-Number." The Finger- 
Pointer system can also recognize the number of out- 
stretched fingers. The operator can communicate with 
the system by displaying different numbers of fingers. 
This feature allows more speedy selection than pointing 
to icons. 

The recognition sequence is shown in Fig. 5. 

1 . Determine scan circle center from the finger lip and 
wrist center of the binarized wall camera image. 

2. Sweep the scan circle and separate finger regions 
on the circle. 

3. Decide how many fingers occupy each extracted 
region. 

When using a low resolution camera, the boundary 
of neighboring fingers becomes indistinct. Thus rec- 
ognizing finger numbers by counting isolated fingers 
would be inaccurate. The proposed method detects the 
correct number of fingers, even if two or more fingers 
are held together. 
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3 4 Lighting by infrared LED 

The Finger-Pointer system employs the fixed 
threshold binarization of camera images for achieving 
real-lime processing. However, the performance of 
binarizaiion with a fixed threshold is influenced by 
lighting conditions. A strong visible light is effective 
for stable binarizaiion, but such a light makes the op- 
erator feel uncomfortable. This problem can be solved 
bv using arravs of infrared LEDs as light sources; their 
light is unnoticeable to the operator. Fillers that elim- 
inate the visible spectrum are positioned in front of 
the CCD cameras. Wiih this combination of infrared 
LEDs and filters, the system can achieve stable bin- 
arizaiion with a fixed threshold regardless of the lighting 
condition of the room. 

4. -VPO-: VIRTUAL PROJECTION ORIGIN 
4.1. Pointing direction 

The operator's pointing direction is determined by 
a straight line that is defined by two points in 3-D 
space We call these two points "Tip-Pomr and "Base- 
Point " The Tip-Point corresponds to the operators 
finger tip. However, the question is the location of the 
Base-Point. A preliminary experiment indicated that 
the position of the Base-Point is different for each op- 
erator Even for the same operator, this point changes 
depending on the pointing style, for example, whether 
the user is tense or relaxed. 

4 2. Virtual projection origin 

For the Finger-Pointer system we employed a virtual 
Base-Point estimated by simple calibration before in- 
teraction. Therefore, the system can unify the user s 
desired pointing direction and the direction perceived 
by the svstem. We assume that the lines of pointing 
direction converge at one (Base) point when the op- 
erator points at several objects on a distant screen ( Fig. 

6) After this calibration, the operator's pointing di- 
rection can be expressed as the projection from the 
converged point through the operator s finger tip ( Fig. 

7 ) . We call this point the -VPO"- Virtual Projection 

Origin. . . 

The VPO calibration procedure is shown in Hg. o. 

I . Display predetermined marks on upper-right comer 
of the front screen. 



finger-tip 




screen 



target (cursor) | 




( Convergence point of pointing lines ) 

VPO Virtual Projection Origin 

Fig. 6. VPO calibration. Estimate the convergence point 
(VPO) of poiniing lines passing from the displayed marks 
through the corresponding finger up position. 



pointing direction 



Fie 7 Poiniing bv VPO. The operator's pointing direction is 
determined as a projection from the VPO through his finger 
lip. 



7 Measure the operator's finger tip position when the 
operator points at the mark, and determine pointing 
line from the mark to the finger tip position. 

3. Repeat this procedure and determine pointing lines 
for several other positions. 

4. Estimate VPO as the point at which these pointing 
lines converge. 

The VPO is the center of a sphere of minimum ra- 
dius that is intersected by all pointing lines (Fig. 8). 
The sphere s radius indicates the convergence rate, and 
a small radius means good convergence (and thus ac- 
curate estimation). 

4.3. Distribution of VPO 

Figure 9 illustrates experimentally measured VPO 
distributions. The experiment tested 20 operators, and 
the distance between the operator and the 120-inch 
diagonal screen was 5.4 m. Each filled circle indicates 
the VPO position for one operator. The radius of each 
circle indicates the minimum sphere size intersected 
bv all pointing lines (a small circle means good con- 
vergence). The figure shows that VPO position differs 
for each operator, and even for the same operator, the 
VPO position changes with the pointing style. 

The experiment provided that the VPO of each op- 
erator converges within a 3.5-cm radius with a prob- 
abilitv of 95%. By using the VPO method, the system 
has a pointing accuracy of 2.0° without cursor feed- 
back, and 0.6° with cursor feedback, for all operators 
and poiniing styles. 



5. CHANNEL SYNCHRONIZATION BY -TIMING-TAG" 
5 1 Multi- modal poiniing 

Human beings normally point using voice and ges- 
tures simultaneously. In this case, the finger is used for 
a "locator," and the voice is used for a "trigger. The 
combination of finger and voice provides more natural 
interaction than just hand gestures. For example, using 
voice allows the operator's hand to be raised just briefly 
for pointing so fatigue is less than occurs with thumb- 
switch triggering. 

The Finger-Pointer provides integration of voice and 
pointing gesture by the use of a (user-specified, sepa- 
rated word tvpe) speech recognition unit. The system 
integrates voice commands, pointing targets, thumb 
triggers, and finger numbers, and decides actions for 
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convergence radius (cr) 



Virtual Projection Origin 
( convergence point ) 

Fig 8 Estimating convergence point. The VPO is estimated as the center of a sphere that has minimum 
radius and is intersected by all pointing lines. 




operator number: 20 



95% convergence radius within 3.5cm 



Fig 9 VPO distribution. Each filled circle indicates the VPO position for one operator. The radius of each 
circle indicates convergence rate ( small circle means good convergence ) 



the target. A typical combination example for a pre- 
sentation application is shown in Table 2. 

5.2. Svmbol mismatch 

In the Finger-Pointer system, pointing recognition 
can be completed in real-time, but the voice recognition 
unit needs a delay of about 0.5 seconds after the voice 
command is vocalized, and the delay changes depend- 
ing on the size of the word dictionary and the word 
input. In general inter-channel synchronization is 
necessary to integrate plural input channels. Each input 
channel has its own recognition module, and each 
module has a different recognition delay. Moreover, 
in many recognition modules, these delay times change 
depending on the input signals. If a group of events 
that take place at the same time are individually cap- 
tured by each channel, the outputs of the recognition 
modules are randomly offset against each other. 
Therefore, the message integrator can't decide which 
symbols must be combined (Fig. 10). 



5.3. Timing-Tags 

The Finger-Pointer system realizes inter-channel 
synchronization by introducing "Timing-Tags." Figure 
11 shows the concept of the Timing-Tag; it's algorithm 
is described below. 

• Each recognition module and message integration 
module are driven by a master clock. 

• When a recognition module receives a pointing 
event, it identifies the event with the current clock 
value, called the "Timing-Tag." 

• After the recognition process, recognized symbols 
are passed to the integrator together with their cor- 
responding timing-tags. 

• The message integration module rearranges the rec- 
ognized symbols according to their timing-tags. 

Through the use of timing-tags, the processing delay 
of each recognition module can be neglected, and the 
system can quickly integrate multi-channel pointing 



Table 2. Combination of gesture and voice commands for a presentation. 



Pointing 



Command (trigger, switcher) 



Voice command sample 



Index Finger 
Index Finger 



Thumb click 
Voice 

Voice + finger number 
Voice 



"This." "that/" "from "to" 
"Forward this many." "Back this many" 
"Clear screen," "calibration" 
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Ch.2: 

lex: voice 



Ch.3: 
ex: gaze 
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A: 'That Point" 
B:"This Direction" 
C:"That Window" 
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F, .0 Ssmbol mismatch. The message in.egn.or canno- ™* * ^ 

r ' 8 ' • of the different recognn.on delays of each module. 



messages. Conventional vo.ee recognizers, however 
can't generate timing-tags. Thus, the timing-tag method 
is implemented in the system as a backsearch.ng voice 
level meehamsm (F.g. 12). When the recogn.zed sym- 
bol is output, the system backsearches the recorded 
voice level buffer and finds the start and end times ol 
the corresponding voice command. The t.m.ng-tag is 
generated from these times. 

5 4 Target detection by pointing velocity 
' When using voice commands as a trigger. . I is nec- 
esarv to extract a specified pointing direcuon from 
several pointing direct.ons measured while vo.ee com- 
mand is spoken. We noticed that the finger up mo- 
mentary halts when it points at the targets. Conse- 
quently, the system can extract the correct po.nt.ng 
direction bv determining the minimum velocity of the 
finger tip adjacent to the vocalization of a vo.ee com- 
mand ( Fig. 1 3 ) . The processing speed of our prototype 
svstem is. unfortunately, cannot follow qu.ck po.nt.ng 
actions. Therefore, the system estimates target locus 
by interpolation of pointing directions. 

6. APPLICATIONS 

We constructed three applications based on the Fin- 
ger-Pointer system. 



Ch.1: 

ex: pointing 



6 1 Presentation system 

Figure 14 illustrates a presentation system that uses 
a computer-based slide projector. Rectangular regtons 
at the bottom of the screen serve as command buttons, 
for example. NextPage. PrevPage. ClearScreen. etc. The 
operator can select commands and emphas.ze the sl.de 
image in real-time by adding marks and lines. The 
operator can control the system by using several com- 
bination of gesture and voice commands ( Table .). 

6.2. Video browser 

"Finger-Pointer" can also be used as a v.deo brows- 
ing svstem (Fig. 15 ). The operator can use hand mo- 
tions and thumb-switch actions to control a VCR. lor 
example: Plav. Stop, and some special search opera- 
tions similar to those offered by a "Shuttle-Ring. 

6.3. "Space-writer" 

The svstem can detect alphanumer.es and graphic 
figures written in space by the operator ( F.g. 16) The 
pen-up/down operation is controlled by the operator s 
thumb-switch. 

7. CONCLUSION 

In this paper, we have introduced the new pointmg 
action recognition system called Finger-Po.mer. It does 
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Fig. 15. Video browser. The operator can control a VCR by hand motions and thumb switching. 





Draw : thumb Down 
Move: thumb Up 



Rg . .6. Drawing characters in space. The system can de.ee, .etters and s,mp.e figures written in space by 

finger. 



not force the operator to wear special devices and re- 
alize a more human-friendly interface. By introducing 
the notion of the VPO ( Virtual Projection Origin ), the 
system can detect stable and accurate pointing regard- 
less of the operator's pointing style. The system also 
realizes multi-channel message synchronization by in- 
troducing Timing-Tags for integrating several recog- 
nition modules, all of which have different processing 
delavs. The system operates in real-time without any 
special image processing hardware, so it is useful as 
the platform of a multi-modal interface for many ap- 
plications such as presentation, navigation, machine 
control, and so on. 

The current prototype system cannot cope with op- 
erator movement after calibration. But the notion of 
the VPO will be applicable for pointing with movement 
by tracking the operator with TV cameras or another 
sensors, and determining the VPO relative to the op- 
erator's location, improvement of processing speed to- 
gether with higher pointing accuracy and the recog- 
nition of more complex hand gestures still remain as 
future problems. 
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